Main Question
Which chemical properties influence the quality of white wines?
dataset info
download dataset
Dataset Source Citation:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Which chemical properties influence the quality of white wines?
## [1] 4898 12
The dataset has 4898 rows and 12 columns.
qualityWe can see that there are many white (\(92.6\%\)) that was rated 5, 6, or 7, a few (\(6.9\%\))that was rated 4 and 8, very few (\(0.5\%\)) was rated 3 or 9.
The mean of each variable is shown in the picture as red line. We can see that residual.sugar, chlorides, free.sulfur.dioxide and density have outliers.
We can see that the strongest correlated pair is density and residual.sugar. There are many pairs that has correlation larger than 0.4 or less thatn -0.4, we will investigate later.
First normalize independent variables, also treat quality as numeric variables, then fit a linear model.
library(caret)
normalize <- function(vals) {
# let each value from a vector to be centered to 0, and scaled with sd 1
avg = mean(vals)
se = sd(vals)
sapply(vals, function(x){(x-avg)/se})
}
# normailze all values in whites except quality
whites2 <- data.frame(sapply(whites[, -12], normalize))
whites2$quality <- as.numeric(levels(whites$quality))[whites$quality]
# set random seed so that every time our training and testing dataset is same
set.seed(1111)
# depart our dataset whites to training and testing
inTrain <- createDataPartition(y=whites2$quality, p=0.6, list=FALSE)
# training2 for building model, testing2 for evaluating model
training2 <- whites2[inTrain, ]
testing2 <- whites2[-inTrain, ]
raw.linear.model <- train(quality~., data=training2, method="lm")
quality in testing dataset and predicted quality for testing datasetThere are white wines with quality of 3, 8, 9 in testing dataset, but we don’t have them in predictions, so sensitivity for quality of 3, 8 and 9 is 0.
Most (\(84\%\)) of redisual is in [-1, 1], and the residual plot is symmetric, so the performace of our linear model is not bad.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.88673982 0.01383152 425.603164 0.000000e+00
## fixed.acidity 0.06119533 0.02194627 2.788416 5.330879e-03
## volatile.acidity -0.17490650 0.01467734 -11.916772 5.257488e-32
## citric.acid -0.01602755 0.01473578 -1.087662 2.768339e-01
## residual.sugar 0.39060037 0.04607454 8.477574 3.592200e-17
## chlorides -0.02683517 0.01614450 -1.662186 9.658251e-02
## free.sulfur.dioxide 0.08928888 0.01860337 4.799607 1.669522e-06
## total.sulfur.dioxide -0.02393980 0.02019112 -1.185660 2.358527e-01
## density -0.39448991 0.06641662 -5.939626 3.193301e-09
## pH 0.09483609 0.01991806 4.761312 2.016882e-06
## sulphates 0.06885210 0.01466256 4.695776 2.778251e-06
## alcohol 0.26571971 0.03523450 7.541464 6.163048e-14
We see than p value for fixed.acidity, citric.acid, chlorides, total.sulfur.dioxide is larger than 0.1, that means that these value have very little linear relation with quality, so I want to remove them and fit a new linear model.
qualityWe can see from the plot that white wines with higher (alcohol + residual.sugar) tend to have higher quality.
We can see from the plot that white wines with lower density or lower volatile.acidity tend to have higher quality.
We can see from the plot that citric.acid and chlorides don’t have much effect on quality.
final.linear.model <- train(quality~volatile.acidity+residual.sugar+
free.sulfur.dioxide+density+pH+sulphates+
alcohol,
data=training2, method="lm")
From residual plot we see that the performance of is model is very similar to previous one, but it only uses 7 features.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.88715131 0.01385397 424.943380 0.000000e+00
## volatile.acidity -0.18323238 0.01401635 -13.072759 5.334585e-38
## residual.sugar 0.32805035 0.03446562 9.518191 3.552598e-21
## free.sulfur.dioxide 0.07174710 0.01512549 4.743456 2.201519e-06
## density -0.30060326 0.04676687 -6.427696 1.506941e-10
## pH 0.06347211 0.01472097 4.311679 1.673642e-05
## sulphates 0.06103093 0.01446811 4.218306 2.536071e-05
## alcohol 0.32481031 0.02840754 11.433948 1.191393e-29
Now all p values for independent variables are less than \(10^{-3}\).
From plot of parameters, we see that alcohol, density, residual.sugar and volatile.acidity affect quality most. Among them, alcohol and residual.sugar have positive affect to quality, density and volatile.acidity have netative affect to quality.
From residual plot we see that the performance of our linear model is not good enough, so we want to find a better machine learning algorithm. Actually quality is ordinal variable, so it is more natural for our model to classify white wines to quality 1~9. Thus I want to try classification algorithms.
library(caret)
# quality is categorical variable, so transform it to factor
whites <- transform(whites, quality=as.factor(quality))
# set random seed so that every time our training and testing dataset is same
set.seed(2222)
# depart our dataset whites to training and testing
inTrain <- createDataPartition(y=whites$quality, p=0.1, list=FALSE)
# will use training to build our model
training <- whites[inTrain, ]
# will use testing to evaluate our model
testing <- whites[-inTrain, ]
# some candidate machine learning methods
methods <- c("bdk", "ctree", "dnn", "earth", "elm", "fda", "gbm",
"kernelpls", "kknn", "knn", "lda", "lvq", "mda", "pam", "pda",
"pls", "polr", "protoclass", "rf", "rpart", "sda", "simpls",
"treebag")
accuracies <- c()
# get accuracy for each machine learning method
for (md in methods) {
model <- train(quality~., data=training, method=md)
accuracy <- sum(testing$quality==predict(model, testing)) /
dim(testing)[1]
accuracies <- c(accuracies, accuracy)
}
data.frame(method=methods, accuracy=accuracies)
## method accuracy
## 1 bdk 0.4278257
## 2 ctree 0.4645937
## 3 dnn 0.4489333
## 4 earth 0.4945529
## 5 elm 0.4521108
## 6 fda 0.4945529
## 7 gbm 0.5156605
## 8 kernelpls 0.4577848
## 9 kknn 0.4738992
## 10 knn 0.4396278
## 11 lda 0.5163414
## 12 lvq 0.4192011
## 13 mda 0.5120291
## 14 pam 0.4489333
## 15 pda 0.5242851
## 16 pls 0.4577848
## 17 polr 0.5240581
## 18 protoclass 0.4062642
## 19 rf 0.5367680
## 20 rpart 0.5102133
## 21 sda 0.5177031
## 22 simpls 0.4577848
## 23 treebag 0.5081707
We find that accuracy of rf (Random Forest) is best in this question. So we will use random forest model.
# set random seed so that every time our training and testing dataset is same
set.seed(3333)
# depart our dataset whites to training and testing
inTrain <- createDataPartition(y=whites$quality, p=0.6, list=FALSE)
# will use training to build our model
training <- whites[inTrain, ]
# will use testing to evaluate our model
testing <- whites[-inTrain, ]
# Fit a random forest model
random.forest.model <- train(quality~., data=training, method="rf")
## rf variable importance
##
## Overall
## alcohol 100.000
## density 76.221
## volatile.acidity 58.249
## free.sulfur.dioxide 45.756
## total.sulfur.dioxide 42.717
## residual.sugar 34.999
## chlorides 27.004
## pH 22.284
## citric.acid 14.915
## sulphates 7.191
## fixed.acidity 0.000
We see that alcohol is the most important variable, that means that alcohol affect quality of white wine most, density and volatile.acidity also affect much on quality.
## Sensitivity Specificity
## Class: 3 0.0000000 1.0000000
## Class: 4 0.1846154 0.9978870
## Class: 5 0.6615120 0.8764535
## Class: 6 0.7997725 0.6163114
## Class: 7 0.4687500 0.9489415
## Class: 8 0.3142857 0.9994703
## Class: 9 0.0000000 1.0000000
Quality of 3 (lowest quality) and 9 (highest quality) has 0 sensitivity, this means that none of quality 3 or 9 was corretly regconized by the model, this is because there are too few (\(\approx 0.5\%\)) whites wines was classified as 3 or 9.
We see that most of our prediction (\(\approx 66\%\)) is correct, some (\(\approx 31\%\)) has error 1 or -1, very few (\(\approx 3\%\)) has error greater than 2 or less than -2. Therefore this model does a good job.
We already know that alcohol, density and volatile.acidity affect quality most, but we don’t know it weather it has positive or negative affects, so we would like to create some box plots to figure out.
We see that quality from 5 to 9, average alcohol is increasing. In other words, for quality larger than 4, white wine with higher alcohol would has higher quality.
We can see that for quality larger than 4, white wine with lower density would has higher quality.
It is not clear how volatile.acidity affect quality from box plot.
Among all 11 sensory data from white wine, residual sugar and density are most correlated variables. Their correlation is 0.84.
This picture shows paramters of our linear model, since all our features are normalized, so the larger absolute value a parameter is, the larger effect of the feature. We see that alcohol and residual.sugar have most positive effect on white wines’ quality, density and volatile.acidity have most negative effect on white wines’ quality.
We also tried some other machine learning methods, and we find random forest performs well. In our random forest model, alcohol, density, and volatile.acidity affect quality of white wines most. From box plot we see that alcohol has positive effect on white wine quality, `density
In Summary alcohol influence white wine quality most, white wines with higher alcohol tend to have higher quality, density and volatile.acidity also influence white wine quality a lot, whites wines with lower density and volatile.acidity tend to have higher quality.
I divide the dataset into training and testing to avoid overfitting. I try to use lm (linear regression), but linear regression requires dependent variable being numeric, so I transform quality to numeric variables, at first I would simply use as.integer(quality) and find it is not correct, finally I find correct way in Stack Overflow. From residual plot I find the performance of linear model is not very good, so I also tried many other machine learning methods like ctree, fda, rf and so on. I find rf (Radom Forest) has the highest accuracy, so I finally use it. There was one problem in both Random Forest Model and Linear Model, that is quality of 3 and 9 has 0 sensitivity, I think this is because there are too little (\(\approx 0.5\%\)) data with quality 3 or 9. Our model can still get high accuracy without figuring out any quality of 3 or 9. In daily life people care more about best whites wines, so I am not satisfied that both models I build could not figure out highest quality wine. One way of improving sensitivity of highest quality white wine is to collect more data about best white wines. Another way may be duplicate existing data with highest quality, for example, in the dataset there are only 5 white wine with highest quality, we can duplicate these data so that there are 10 or 20 or more data with highest quaity wine.